Optimal Column- Based Low-Rank Matrix Reconstruction * 



Venkatesan Guruswamit Ali Kemal Sinop* 

Computer Science Department 
Carnegie Mellon University 
Pittsburgh, PA 15213. " 



Abstract 

We prove that for any real- valued matrix X 6 R mxn , and 
positive integers r k, there is a subset of r columns 
of X such that projecting X onto their span gives a 
\j r-l+i -approximation to best rank-fc approximation of X 
in Frobenius norm. We show that the trade-off we achieve 
between the number of columns and the approximation 
ratio is optimal up to lower order terms. Furthermore, 
there is a deterministic algorithm to find such a subset of 
columns that runs in Oirnm? log m) arithmetic operations 
where to is the exponent of matrix multiplication. We also 
give a faster randomized algorithm that runs in 0(rnm 2 ) 
arithmetic operations. 

1 Introduction 

Given a matrix X G flj mx ™ an( ^ a p 0S itive integer k < n, 
the best rank-A: approximation to X is given by top k 
singular vectors of X: 

k 

x (k) = ^2 V° r i u i v T 

where o\ ^ 02 ^ . . . ^ o n ^ are the eigenval- 
ues of X T X, and Ui (resp. Vi) are the associated 
left (resp. right) singular vectors for each singular 
value JWi. Furthermore Xiv\ can be computed in time 
0(min(n, m)mn)-time using Singular Value Decompo- 
sition (SVD). 

One related question that has received considerable 
attention in recent years is choosing r columns of X, for 
some input parameter r k, whose span approximates 
X as nearly as well as Xrty. In other words, we would 
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like to relate 

min \\X-X%X\\t= mm \\X£X\\t 

ce( [ " ] ) ce( [ " ] ) 

to \\X — XrfyWi f° r some norm £, and efficiently find 
a subset C of r columns coming close to this bound. 
Here Xq denotes matrix formed by columns of X 
corresponding to C and X^j (resp. Xq ) is the projection 
matrix onto Xq (resp. onto null space of Xc)- 

This basic problem seems well-motivated in vari- 
ous application settings. For example, this problem has 
applications in data sets arising from document clas- 
sification problems, face recognition tasks, and so on, 
where it is important to pick a subset of features that 
are dominant (and it is not appropriate to work with lin- 
ear combinations of features output by usual dimension 
reduction techniques like random projection or singu- 
lar value decomposition). We refer the reader to Ma- 
honey and Drineas [10] for comparisons of SVD and 
column selection on experimental data. 

Our interest in this problem stemmed from our 
own work on improved approximation algorithms us- 
ing certain Scmidcfinitc Programming relaxations from 
the so-called Lasserre Hierarchy 8J. The analysis of 
our algorithms relied on bounded quantities such as 
min Cg /[„]\ for the Frobenius norm. In this 

application, the running time is exponential in r, where 
r is the number of columns one has to choose from X 
to approximate X in Frobenius norm as close to X^ as 
possible. Thus finding the optimal dependence between 
r and k was a question of natural significance. 

Our main results in this paper are the following two 
theorems. We are able to get the best known depen- 
dence between r and k, show its optimality up to lower 
order terms, and achieve this with an efficient deter- 
ministic algorithm (Theorem II. ip . This answers one 
of the open questions mentioned in pp. We are also 
able to give a more efficient randomized algorithm, via 
a faster implementation of exact volume sampling (The- 
orem II. 2p . The deterministic algorithm of Theorem 11.11 
is a derandomization of the volume sampling algorithm 



via conditional expectations [4] . 

Theorem 1.1. Given X e R mx ™, and positive integers 
k ^ r, one can find a set C of r columns, determinis- 
tically using at most 0(rnm u log m) many arithmetic 
operations ( where 10 is the exponent of matrix multipli- 
cation), such that 

r + 1 



(1.1) 
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\x 



Furthermore, for any r = o(n), this bound is tight up to 
lower order terms. 

Theorem 1.2. Given a matrix X e R mxn , m ^ n, 
and r 1, there is an algorithm Volume-Sample 
that samples a subset of r columns of X, C £ ( )> 
|xgx | 



with probability 



Te (H) 



X^X T 



using at most O (mm 



arithmetic operations. For every k ^ r, the subset C 
returned by Volume-Sample satisfies 



E 



c 



\X 



^c\\f 



1 



r + 1 



\X - X, 



(k) 



2 
F 



Note that \\X - X£X\\ 2 F = \\X^X\\ F = Tr(X T X£X). 
Henceforth in this paper, we will use the Trace notation. 



1.1 Relation to previous work The first algorithm 
for /c-column matrix reconstruction was given in a 
seminal paper by Frieze, Kannan and Vempala [7], 
where they presented a randomized algorithm to find 
poly(fc, 1/e, 1/(5) columns that achieve an additive error 
ofe\\X\\ F . 

Subsequent works concentrated on removing the ad- 
ditive factor and getting multiplicative (or relative er- 
ror) guarantees, and improving the dependence between 
r and k to get a desired relative error. Some of these 
works are mentioned in Figure [TJ In the table, r is the 
number of columns needed so as to obtain the given 
approximation ratio, defined as Tr(X T XjkX)/\\X — 

x (k)\\ 2 F- 

To briefly place our result in context, let us mention 
the known existential bounds on the relation between 
r, k, and the ratio achieved. Deshpande et al 5 prove 
the existence of k columns achieving a ratio k + l, and 
also show that this is best possible up to lower order 
terms. Deshpande and Vempala [6] prove that for small 
e > 0, there exists a matrix M for which the best 
error achieved by a rank-A: matrix, whose columns are 
restricted to belong to the span of r > k/e columns 
of M, is at least 1 + e — o(l) times the best rank-fc 
approximationQ 



1 Although in [6] the lower bound is stated as 1 + | — o(l), the 
actual lower bound they prove is stronger and equals 1 + e — o(l). 



Until recently, even the best existential bound to 
achieve (1 + e) approximation was super-linear in k. 
In an independent and concurrent work, Boutsidis, 
Drineas, and Magdon-Ismail [T] showed a bound of 
r s» k + 2s along with a randomized algorithm to find 
such a subset of columnsH Our main result proves that 
k/e + k — 1 columns are sufficient, and further those 
columns can be found in deterministic polynomial time. 

The (1 + e) approximation achieved in p] holds 
in the restricted model (in which the above-mentioned 
k/e lower bound of [6] applies) where one must find 
a rank-k approximation matrix contained in the span 
of the chosen r columns, whereas our approximating 
matrix uses the full span of the chosen columns. So 
our results and [T] are incomparable in this respect. We 
stress though that even allowing for full column span, 
no bounds on r which were linear in k were known till 
recently, for achieving say a factor 2 approximation. 
Further, we extend the lower bound in [6] to show that 
even allowing for full column span, r = k/e columns are 
needed for a factor (1 + s — o(l)) approximation. 

Note that our result gives the optimal (k + l) factor 
approximation (taking e = k) for r — k, and for e — > 0, 
the near-optimal (1 + e) factor for r ~ k/e, in a uniform 
way. As for the algorithmic claim, recently Deshpande 
and Rademacher [1] gave an efficient implementation of 
volume sampling and a deterministic algorithm to find 
a set k columns with approximation ratio k + l, thus 
matching the bound of _5 j algorithmically. We simply 
bound the ratio achieved by this algorithm when it is 
allowed to pick r > k columns. In other words, the 
algorithmic part of Theorem 11.11 follows from [4! , given 
our combinatorial bound. 

Prior to our work, the fastest algorithm known 
for exact volume sampling was given in [4] using 
O (mm u log m) arithmetic operations. We give an 
asymptotically faster sampling algorithm, by using bi- 
nary search to pick the lowest index column in the sam- 
pled set with the correct marginal probability, and then 
recursing to sample the remaining r — 1 columns. 

1.2 Our Techniques Our proof is based on the 
following bound: 

(1.2) min Tr{X T X±X) < (r + 1)%^ 



= E 



C~C r (X) 



MX+X^X) 



3 The theorem statement in [T] mentions the weaker bound 
r ^ lOk/e, but the sharper bound is given at the end of Section 4 
of the paper. 
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Figure 1: Performance and running time of various column selection algorithms 



where C ~ C r (X) denotes sampling C with probability 
proportional to determinant of X^Xc, \X^,Xq\, and 
S r (o~) is the r'th symmetric function of <7i, o^j . . . , o n . 
The bound (|1.2p already appears in the work of Desh- 
pande et al. [5] where sampling from C r (X) is called 
"volume sampling." 

Our main technical contribution is to use the Schur- 

M<0 



concavity of 



and theory of majorization to 



bound in terms of J2i>k+i a i- At an intuitive 

level, the ratio § § + ^^ should be larger when {cri}™ =1 
is more "uniform." Majorization and Schur-concavity 
allow us to turn this intuition into a precise and formal 
statement. This leads us to the inequality 



2 Preliminaries and Notation 

For any positive integer n, we use [n] = {i £ N : i ^ n} 
to denote the set of positive integers smaller than or 
equal to n. We will use ( , ) to denote the fc-subsets of 
A. 

Given real vector a = (a;)™ =1 £ K", we will use afi 
(resp. a^i) to denote the i th smallest (resp. largest) 
element of {a.i}i. 

We say a — (a.;)™ =1 £ M" majorizes b = (bi)f =1 £ 
E n if for all j £ [n], J2 3 >^j a Hj'^jb and 

J2j a j = J2 j bj ■ We denote this relation by a >~ b. 

Observation 2.1. For any non-negative vector a £ 
R n ^ 0, t/ie following holds: 



(1.3) (r + 1) 



S r +i(<7) ^, r+1 

Sr(ff) 



ft ^ 

i>k 



(k) 



2 

+' 



E l>fc o-i and (L2 



which together with ||Y — X 
yields the claimed bound (|1.1[) . For the nearly matching 
lower bound, we prove that for the construction given 
in [6], the lower bound on approximation ratio holds 
even in the unrestricted model where the full column 
span of the r columns is allowed; this analysis appears 
in Section [5J 

As for the algorithm, Deshpande and 
Rademacher |4] used the method of condi- 
tional expectations to find C £ satisfying 

Tr(X T X&X) 



Tr(Y 



-C~C r (X) 



deterministi- 



cally using 0(rnm u log m) operations. Together with 
our bound (|1.3p . this implies a deterministic algorithm 
achieving a r r ^ l l_ k ratio. In light of this, we do not 
discuss the deterministic part any further in this paper, 
and focus on proving and (|1.3[) and (|1.2[) . which we do 
in Sections [3] and 2] respectively. Our more efficient 
volume sampling algorithm is described in Section [5] 
The proof of our lower bound is presented in Section [6] 



(1,0,. ..,0) >- ^—ay [-,-,...,- 

2^0j \n n n 



Definition 2.1. A function F : E" — > M is called 
Schur-concave if whenever a £ M™ majorizes b £ M. n , 
a^b, F(a) < F(b). 

Definition 2.2. (Symmetric polynomials) For a 
given cr = (<7i, . . . , er n ) 6 W l , let § r (er) denote the r th 
symmetric polynomial: 

Likewise, for a given square matrix A £ J> mx " l ; § r (A) 
is defined as 

S r (A)= \ A W\ > 

where Af/iy is the minor of A corresponding to columns 
and rows in U . 



Lemma 2.1. If A £ 
8r(A) = Sr(tr). 



has eigenvalues {cri}, the 



Proof. The coefficient of x m r in IliO 7 * ~ x ) equals 
(-l) m - r § r (o-). Similarly (-l) m - r § r (A) is the co- 
efficient of x m ~ r in |— xl + A\. Now, note that 

\-xI + A\ = n.i(<n-*)- 

Given a matrix X £ ^rnxn anc ] j g ^ we use 
to denote i*' 1 column of X. Similarly given a subset 
of columns, C C [n], we use to denote the matrix 
formed by columns from C, Xc = pQ)ieC- Also we will 
let X u and JT 1 - be the projection matrix onto range and 
null space of X respectively. 

For any square matrix A £ R mxm , we will use \A\ 
to denote the determinant of A, Tr(A) to denote trace 
of A and (Ti(A) to denote the i th largest eigenvalue of 
A. 

Lemma 2.2. For any A £ R mxr , if all r columns of A 
are linearly independent, then the distance of x £ 

A T A A T x 



z T A 



WA\ 



to span of A is given by WA^-xW 2 = 
Proof. Note that by elementary row operations 
A T A : 



T 

X X 



A T A A T x 

x T A x T x 







T 

X x 



"A{A T A)- 1 A T 3 



= \A T A\ \x T A ± x\ = \A T A\ \\A A 



where we used the fact that A(A T A)- 1 A T = A n and 
I- A u = A^. 

3 Bound on ratio of symmetric functions 

The following theorem was first proved in the classic 
paper of Schur [T3] . See also [TTJ Section 3] . We present 
a different proof below. 

Theorem 3.1. For any a £ R n ^ ; the ratio Sr+l((r) 



Sch 



S r (<T) 



ur-concave. 



Proof. By Schur's criterion to establish Schur-concavity 
of symmetric functions, it suffices to show that 



d 



gr+l(g) 

dm 



gr(g) 

d<jj 



(*) 

for all Using the identities 

^ S*(V S r (cr)S r (er \ <Ji) - § r+ i(cr)S r _i(cr \ (Tj) 



da l §r( cr ) 

§fc(cr \ O-i) =ajS k -i(a \ {(Ti, CTj}) + S fc (cr \ {(Tj, CTj}) 



we have that 

(*)S*(<r) =Sr(tr) [Sr(<r \ at) - S r (a \ aj)] 

- § r+ i(cr) [S r _i(cr\ Oi) - § r -i(a\ <7j)] 
=§ r (cr) (<7j - <7i) S r _l (cr\ {(7i,CT-,}) 

- § r+ i(cr) ((Tj - o-i) § r _ 2 (cr \ {cr^er,}) 



§ T .(cr)S r _i (cr \ {ct^CTj}) 



§ r+ i(cr)S r _ 2 (cr \ {o-i, (Tj}) 



Note that if we can show that the expression 

§ r (»§ r _l [a \ {0-i,O-j}) - § r+ i(»§ r _2 (ct \ {cTi,^-}) 

is non-negative, we are done. For r = 2, § r _2 = hence 
we will consider the case when r ^ 3. 

We will do so by exhibiting a flow / on a bipartite 
graph with left nodes labeled with L = (JJJ x ( [n] ^ j} ) 

and right nodes labeled with R = (M) x ( W )l\ j} ) 
with the property that if there is a non-zero flow from 
(S,T) £ L to (S',T0 e i? then JJieS *i U 3 eT <*i < 
IlieS' °^ IljeT' ct j an d total flow leaving any node on 
left is 1 whereas total flow entering any node on right is 
at most 1. 

Given (S,T) £ (Wj x (^f), consider U = 
S\(TU {i, j}) ^ 0. For each k £ U, we set 

1 



/(S,T),(S\{fc},Tu{fc}) 



\u\ 



By construction, this satisfies the following: 

!• T,(S',T')eRf(S,T),(S',T') = 1. 

2- f(s,T),{S',T') (jlies °i UjeT a i ~ Ekes' CT i IIjgt' ^ 
0. 

In order to prove that S(s,t)gl f(s,T),(s<,T>) «S 1, if 
f(S,T),(S',T') 7^ 0, then there exists /c for some k £ 
V \ S' such that T = T'\ {fc}, 5 = 5' U {fe}. Hence 
|S'\(T'U{i,j})| = |£\(ru{*,j})|-l. Therefore 



X! f(S,T),(S'.T' 
(S,T)eL 

(3.4) 



E 



i 



^ \S'\(T'U{i,j})\ + l 

\r\s'\ 

: |5'\(T'U{ ? ,j})| + 1 



We have \S'\ = \T\ + 1 > 3, |S' \ (T' U {i, + 1 ^ 
|S"\T"| — 2 + 1. Therefore Equation ([23 can be upper 
bounded by: 

(3 - 5) ^ |y\ri-i = 1 

where Equation follows from \S'\ = \T'\ + 1 =^> 
= |T'\5'| + 1. 



We now use the Schur-concavity to prove our upper 

> 0, 



bound on 5g±iM 



Lemma 3.1. For any non-negative vector p 6 
positive integers k,r such that r k: 



Sr+l00 



1 




S r (/9) r + l-fc 
Proof. Note that, for any 0: 



S r+1 (j3p) _ F+ 1 S r+1 (p) _ S r+1 (p) 



/3 r 



-(P) 



•GO 



Thus without loss of generality, we may assume that 
Pi = 1- Further, we can assume that p is sorted in 
non-increasing order. Let a = J^i^kPi- Consider the 
following series p' . 



ifi>fc + l, 



else. 



Since p is sorted in non-increasing order, it is easy 
to see that, for all i we have p\ ^ Pi+i- We 
have (p[,...,p' k ) = (f,...,f) -< (pi,...,p k ) and 

(Pfc+D • • • iPn) = (T^Efi • • • i TjEf ) (Pfc+1, ■ • ■ 

Therefore p 1 < p which implies: 

S r + l(/0) < S r+ l(/}') 



k ( \r-e+l t 

So^^fc (l) (r-e+l) (^3fc J (f) 

/fc\ /n — fc\ / 1 — Q A r fa\t 
l^O^i^k W\r-e) \n-k) \k) 

Efk\ n-k-r+i (n-k\ ( l-g \ r '/o\ 



Eo<^<fe G) (™-^) (n-fc) (fe) 
n — r 1 — a 1 



n — k r — fc+1 



- fc + 1 



(1-a) 



4 Bounds on column reconstruction 

We now present the upper bound relating the best 
r-column reconstruction of a matrix X to the error 
\\X — XtfyWp of the best rank-Zc approximation in the 
Frobenius norm. 



Theorem 4.1. For any X G 
gers r fc 1. 



min Tr(X T X^X) < E c ^ Cr(x) 
se(t»l) 



and positive inte- 



Mx+x^x) 
i 



1 - k 



X - X, 



(k)\ 



where C ~ C r (X) denotes sampling C with probability 
proportional to determinant of X^Xq, \XqXc\- In 
other words, for any positive real e > 0, 



min Tr(X T X^X) ^ (1 + e)\\X - X {k) \\ 2 . 

S6 (fc/e + fc-l) 

Furthermore, for any r = o(n), this bound is tight up 
to lower order terms in the number of columns chosen: 
There exists a matrix X € R nx ™ such that 

(1 + e- o(l)) \\X - X (k) || 2 < min Tr(X T X^X). 

Proof. The first bound is obvious since the minimum is 
upper bounded by the average. For the second bound, 



note that E c ^c r (X) 



Tr(X T X£X) 



is equal to 



5e( M) 



X^Xs Tr(X T X£X) 





XgX S 


X s \ 

\\XgX u \\ 2 




X T S X S 





J2s Y^u 


\ X S,u X S,u\ 




X^Xs 





(r + l)Er|^r^r| 

T.s\x T s x s \ 

S r+ i(<r) 



(using Lemma 



=(r + l) 



Sr(ff) 



(using Lemma l2TTj) 



where ci ^ 02 • • • cr n ^0 are the eigenvalues 
of X T X. The claimed upper bound now follows by 
applying the bound from Lemma 13.11 and recalling 

\\ X ~ X (k)\\F = Epfc+1 a i- 

Existence of X follows from Lemma 16.21 given in 
Section [SI 

5 Fast volume sampling algorithm 

In this section, we describe and analyze our volume 
sampling algorithm, which leads to the proof of The- 
orem [L2] 

Theorem 5.1. Given a matrix X e R mxn , m < 
n, and an integer r, Algorithm Volume-Sample(X,r) 

returns C G with probability — — ^ C "^r_y | • 

Furthermore it can be implemented using at most 
O (rm 2 n) arithmetic operations. 

Proof of Correctness. For correctness, notice that for C 
sampled with probability IX^Xcl, if we let C = {i\ < 



Algorithm 1 Volume-Sample(X, r). 



Input: X G R mx ™ an d positive integer r. 

Output: r columns of X, C £ ('"'), chosen with probability proportional to |X<£Ac|: C ~ C r (X). 
Procedure: 

1. Let C <— 0. Initialize the table T of the n outer products X[^ n ]X£ n ], £ G [n]. 

2. Choose r uniformly at random from [0, 1]. 

3. T -S r (X T X). 

4. For i ■<— 1 to r: 

(a) ^ «— 1, u «- n. 

(b) While l^u 

i. m<r- L^J. 

ii. ft. <- S r (X[£ in ] T X[^„]) - § r (X[ m+l!n ] T X[ m+1 ,n]) which is equal to § r (A[^„]X[£ in ] T ) - 

§r (X[m+l,n]^[m+l,ri] T ) USmg T. 

hi. If < > /i, then t t - h, £ <- m + 1. 
iv. Else ?! to. 

(c) C<-CU {£}, X <— XgX and update the table T of outer products. 

5. Return C. 



12 < 



< ir}- 



l § r-l(Xy + ln] X^-X [j + ly 
S r {X T X) 



Notice that the algorithm, when it exists out of the while 
loop for the first time, chooses each £ with probability 

^r(X^X^ n ]) - § r(^ + M^+l,n]) 



S r (X T X) 



\X, 



S r (X T X) 



which completes the proof. 

Proof of Running Time. We assume each elementary 
arithmetic operation takes unit time. 

By [21 Section 16.6], we can compute 
®r(Xf ln] X [e ^ n] ) = S r (X [e ^ n] X^ n] ) in time 0(m"logm) 
given the outer product X^^X^ n y Since 

X AuB X AuB T = X A X\ + X B X B T , we can com- 
pute the table T all the n outer products X^ n ]X^ n ^ 

for £ 6 [n], in time 0(m 2 n). Also, given Xi, if we let 

z 



PITT' 



(X^X S )(X^X S ) T = X S X S T + zz T {z T X s X T s z) 
- zz T X s Xl - X s X^zz T . 



Hence, after choosing some column £, we can update 
each outer product matrix in the table T in 0(m 2 ) time. 
Since there are at most n matrices in this table, each 
update step takes 0(m 2 n) time. 

For each column we choose, we evaluate at most 
O(logn) many symmetric functions S r . Thus choosing 
one column takes time 0(m" log to log n) given the table 
T . Since we choose r columns, the total amount of time, 
including the time to initialize and update T in each 
iteration, is bounded by 

O (rra u log to log n + rm 2 n) 

= O (rm 2 (m u ~ 2 log to log n + n)l . 

Since to" -2 log m logn ^ y^nlog 2 n = o(n), this bound 
becomes O (rm 2 n). 

The claim in Theorem 11.21 about the perfor- 
mance of Algorithm Volume-Sample as a column- 
selection algorithm follows from the upper bound on 



E, 



C~C r (X) 



Tr(X^X; 



in Theorem 14. II 



6 Lower bound for column-selection 

In this section, we construct matrices for given k and 
r for which the upper bound stated in Theorem 14.11 is 
nearly tight. Our construction is in fact the same as the 
one given by Deshpande and Vempala [6] . Our analysis 



is different and shows a lower bound on the quantity 
Tr(X T Xg X) where the full column span of the chosen 
r columns is allowed for approximating X. 

Definition 6.1. Given 8 > and m, we define 
M {m ^ G R mxm as 

= 81 + J, 

where I is the identity matrix of dimension m, and J 
the all 1 's m X m matrix. 

Observation 6.1. Given any 8 > and positive inte- 
ger m, the followings hold for the matrix M^ m ^ : 

1. Tr(A//(" 1 ^)) =m(l + 5). 

2. Its largest eigenvector is the all 1 's vector, with 
corresponding eigenvalue o\ = G\ given 
by &\ = S + m. Rest of the eigenvalues are all 
with value o~2 = 03 = . . . = cr m = 8. 



m — 1 



3. \M( m >v 1 = nr=i °~i = §m + m6 



Lemma 6.1. Given any 8 > and positive integer r, 
for n^r, if we let X T X = M^ s \ then 



Tr 



(x^Xs 1 - 



X 



^ 1 



Proof. Note that ||X - X (1) ||f, = J2i>2 a * = ( n ~ 
1)8. For any subset C C [n] of size |C| = r, the 
corresponding minor of X T X is given by 



X T C X C = M« c ^ = 
Consequently for i C, 



Xj,X c \ = 5 r + r6 r ~ 1 . 



I X-c — 



\X-J;Xc\ 



8 r {8 + (r + 1)) 
(8 + r) 



= 8 1 



In particular, 

Tr(X T X^X) = (n - r)(S [ 1 
Therefore 



Tr (X T X S ± X) 
11^-^(1)111, 



n — 1 



1 



Lemma 6.2. For any positive integer n and positive 
integers k and r,r^k, such that r = o(n), there exists 
a matrix of size nxn, X € R" xn for which the following 
holds: 



Tr(X T X^X) 



se(W) X-X, 



(*0l 



l + --o(l) 

r 



Proof. We will hx 8 to be an inhnitcsimally small 
number, 8 = (1). 

For n = nQ-k with uq r+1, let X be chosen so that 
X T X is WocA; diagonal matrix of size n x n = n$k x no/c 
with k copies of M {n °^ on its diagonals: 

^JVf(«o,5) Q(n ) . . . rj(n ) \ 

X T X = 

I fj(«o) . . . _/\//(no,<5) y 

=J (fc ) ® M^ no,<5 ' 

where we used O^™- 1 and /' TO -* to denote matrices of size 
Til x m consisting of all zeroes and identity respectively. 
Here ® denotes tensor (Kronecker) product. By prop- 
erty of tensoring [S] , X T X has k copies of each eigen- 
value of M^ n °' S \ In particular, 



(6.6) 



\X - ^(fe)| 



n(l + 8) — n — k8 = (n — k)8. 



We will use [k] x [no] to index the columns of matrix 
X, so that for any i € [k], if we let = Xuy x t no ], 

we have X^ T X^ = M^ no ' s \ and for any i ^ j € [k], 
X^ T X^ = Q^ n °\ 

Proceeding as in [B], given S, let Si be the set of 
columns chosen from i th block, so that Si = {j € [no] | 
(i,j) G S}. It is easy to see that, 



(iW'ls 1 !'')) - Tr (x^X^X^ 
>8(n -\Si\) (1 + ^ 



Tr 



where we used Lemma IQ1 Therefore 



S* 



Jr(x T X s ^x) = ^Tr (x^xf^X 



(6.7) 



= ^5(n -|^|)(l 



15', 



Note that (n — x)(l + 1/(8 + x)) is convex as long as 
x + 8 0. Therefore we can use Jensen's inequality and 



lower bound the expression in (16.7[) by 

1 



5k no 



=5 (n - r) 1 + 



Recalling the bound (|6.6[) for the best rank-fc approxi- 
mation, we see that for any S with \S\ = r = o(n) and 
6 = o(l): 



x-x, 



Ml 



n — r ( k , . . . 

> r 1 + -(l-o(l) 

n — k \ r 



>l+--o(l) 

r 
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